Skip to content

Perf #856: inline volatile cancellation check at loop backedges#874

Merged
nickna merged 1 commit into
mainfrom
wrk/issue-856-cancellation-inline
Jun 21, 2026
Merged

Perf #856: inline volatile cancellation check at loop backedges#874
nickna merged 1 commit into
mainfrom
wrk/issue-856-cancellation-inline

Conversation

@nickna

@nickna nickna commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Summary

Closes part of #856 (the tracked "next lever" — the per-iteration $Runtime::CheckCancellation() call).

Every loop backedge emitted an unconditional call $Runtime.CheckCancellation(). RyuJIT won't inline CheckCancellation (it contains newobj+throw), so that bare call sat in every loop body as a per-iteration optimization barrier — measured at ~half the runtime of a tight numeric loop.

This inlines the field test on the hot path and only calls the throwing helper on the cold cancel path:

volatile. ldsfld bool $Runtime::_cancelRequested
brfalse   <loop body>                       // hot path: skip straight past
call      $Runtime::CheckCancellation()      // cold: only when cancelling

The volatile. prefix is mandatory for correctness: _cancelRequested is loop-invariant, so a plain ldsfld could be hoisted out of the loop by RyuJIT's LICM — reading the flag once and never re-checking, silently reintroducing the #74 async-hang. The volatile read forbids the hoist and measured at zero cost. Only the loop-backedge emitter changed; the cold throw stays in the helper, and the dynamic-invocation recursion guard (EmitStackGuard) is left as a plain call (not per-iteration-hot; directly-compiled calls bypass it).

Results (branch vs main vs Node)

Workload main branch speedup Node
factorial (tight numeric loop) ~1027 ms ~665 ms 1.6× ~210 ms
count-primes @100k (sieve) ~403 ms ~359 ms 1.12× ~264 ms

The win scales with the loop's arithmetic fraction (biggest on pure-numeric loops, smaller where array ops dominate). No regression on other workloads.

Correctness — verified directly

  • Test262 runner: compile-mode cooperative cancellation #74 cancellation preserved: a compiled while(true) loop ran 1s, then unwound with OperationCanceledException the instant _cancelRequested was tripped via reflection (the exact test-harness path). vm-timeout tests pass; Test262 Timeout=0.
  • dotnet test: 13994 pass. The Test262 baseline-drift failures (2 interpreted / 4 compiled, all in Array.isArray/Math/Proxy families) reproduce identically on main — pre-existing stale baseline, not from this change (verified by stash + rebuild + re-run). The 6 compiled-mode unit failures pass in isolation (parallel DLL-build contention flakiness).
  • SharpTS.TypeScriptConformance: green (codegen change can't affect the type-checker).

Follow-ups (not in this PR)

  • Throttle (per-loop counter, check every N) would reach the ~2.27× ceiling on pure-numeric loops, at the cost of a counter local.
  • Pure-numeric loops remain ~3× off Node due to separate levers: non-inlined user-function calls + boxed top-level vars.

The per-iteration `call $Runtime.CheckCancellation()` emitted at every loop
backedge sat in every loop body as an optimization barrier: RyuJIT won't
inline CheckCancellation (it contains newobj+throw), so the bare call was
measured at ~half the runtime of a tight numeric loop.

Inline the field test on the hot path and only call the throwing helper on
the cold cancel path:

    volatile. ldsfld _cancelRequested
    brfalse   <loop body>
    call      CheckCancellation()   // cold: only when cancelling

The volatile. prefix is mandatory: _cancelRequested is loop-invariant, so a
plain ldsfld could be hoisted out of the loop by LICM, reading the flag once
and never re-checking — silently reintroducing the #74 async-hang. Volatile
forbids the hoist at zero measured cost.

Results (branch vs main vs Node):
  factorial tight loop  1027 -> 665 ms (1.6x; Node 210)
  count-primes @100k     403 -> 359 ms (1.12x; Node 264)

Correctness: compiled while(true) unwinds with OperationCanceledException
the instant _cancelRequested is tripped via reflection (the #74 harness
path); vm-timeout tests pass; Test262 Timeout=0. Loop-backedge emitter only;
the dynamic-invocation recursion guard (EmitStackGuard) is left as-is.
@nickna nickna merged commit cde195a into main Jun 21, 2026
3 checks passed
nickna added a commit that referenced this pull request Jun 21, 2026
…llation-check inlining

Compiled output now meets/beats Node on 5/7 benchmark workloads; the two
stragglers (count-primes ~1.3x, factorial ~3x) are bounded by separate
non-codegen factors. Records the #874 inline-volatile loop-cancellation win
(1.6x tight loops / 1.12x sieve) and the rejected throttle variant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant